-
Couldn't load subscription status.
- Fork 2.8k
[GPU] sdpa_micro for prefix caching #31968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d5b9a06 to
a62e2ad
Compare
5da134a to
d95932b
Compare
c330023 to
98cb9b2
Compare
| } | ||
| } | ||
|
|
||
| if (config.is_paged_attention && data_type_traits::is_i8_u8(K.data_type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we use config.is_kv_compressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config.is_kv_compressed is being used for the non-PA case. I'm not sure when it is used. But from the code, I see that it requires separate scale and zp inputs when config.is_kv_compressed is set. So, I didn't config.is_kv_compressed for the PA case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except a minor comment
98cb9b2 to
63db02c
Compare
94065db to
aea9c4c
Compare
aea9c4c to
87deb35
Compare
Details:
sdpa_microto support paged attention for better performance.mixedstage of paged attention will be handled bysdpa_microinstead ofpa_sdpa_opt.sdpa_microto supportsliding window.Tickets: